Add support for non-UTF-8 strings #46

hanmertens · 2021-02-08T23:29:02Z

This is a start towards allowing arguments that aren't UTF-8. At the moment there's just the groundwork to show that argument parsing is also possible based on &OsStr. It's not finished yet, but I'd like to get some feedback and discuss some design decisions beforehand. The API of gumdrop itself has to change quite a bit, but I think users that just use derive(Options) shouldn't need to notice much of the change.

Related to #15.

What becomes `&OsStr` or `AsRef<OsStr>`?

The parser accepts now AsRef<OsStr> instead of AsRef<str>, but normal string (slices) implement this too, so that isn't too big of a change. All commands, short/long options stay &str, but free arguments and arguments to long options become &OsStr before being parsed in gumdrop_derive.

The functions gumdrop::{parse_args, parse_args_default} currently accept &[String]. I think this either can be kept the same but accepting &[S] where S: AsRef<OsStr> or two new functions gumdrop::{parse_args_os, parse_args_os_default} that accept &[OsString] can be made.

In gumdrop_derive the parsing options based on &str can remain, and the arguments can be converted from &OsStr to &str first, yielding an error if that's not possible (without this PR, this would yield a panic). For this purpose I made a doc-hidden but public function gumdrop::to_str, I'm not sure if that's the best approach. In addition to parsing based on &str, parsing options based on &OsStr can be added (basically the whole point of this PR):

parse(from_os_str = "...") for fn(&OsStr) -> T
parse(try_from_os_str = "...") for fn(&OsStr) -> Result<T, E> where E: Display
parse(from_os_str) uses std::convert::From::from
I don't think parse(try_from_str) has a good OsStr equivalent

Limitations

If available, platform-specific OsStrExt is used to look ahead and parse arguments starting from the beginning. Otherwise, the only API available to get the inner data of the OsStr is to_string_lossy, which checks the entire string for valid UTF-8 and allocates a String if that isn't the case, replacing invalid UTF-8 with the replacement character. This means an option like --valid-utf8=invalid-utf8 cannot be parsed into LongWithArg(&str, &OsStr), because there is no way to shorten the &OsStr to invalid-utf8 while keeping the invalid UTF-8 characters. However, --valid-utf8 invalid-utf8 (with a space instead of equal sign) can be parsed correctly on all platforms, and most platforms (especially in terms of usage) implement a form of OsStrExt.

At the moment I crudely check with cfg(unix) for unix's version of OsStrExt, but this should be extended to other platforms that export the same OsStrExt but not all other unix stuff.

Windows has a different OsStrExt trait, which cannot turn --valid-utf8=invalid-utf8 into LongWithArg(&str, &OsStr), but can turn it into LongWithArg(Cow<str>, Cow<OsStr>). This is because Windows uses 16-bit code points internally, which translates to that you can't shorten a &OsStr, but you can create a new OsString that contains only part of the original &OsStr.

This paves the way for deriving fields in Options structs directly from OsStr, e.g. allowing for non-UTF-8 file paths. Commands and short/long options still need to be valid UTF-8.

This requires Opt::LongWithArg to use Cow (or (Os)String), &OsStr cannot be shortened on Windows. This in turn means the enum cannot be Copy, and should therefore be passed by reference whenever possible.

This does not happen automatically when a later function is generic (e.g. From::from).

Three types of parsing support that have &str counterparts: from_os_str (named and unnamed) and try_from_os_str (named). The fourth type of &str-based parsing (unnamed try_from_str using std::str::FromStr) does not have an &OsStr counterpart.

The same OsStrExt is present on some platforms that are not considered to be in the Unix family. Using more fine-grained cfg-attributes, these platforms can also support --valid-utf8=invalid-utf8. I have verified that it compiles for the x86_64-fortanix-unknown-sgx and wasm32-wasi targets, but not that it runs correctly.

hanmertens · 2021-02-17T22:38:11Z

I think this PR is mostly finished, with the following caveats:

Options like --valid-utf8=invalid-utf8 work on most platforms (Unix, Windows, and some more niche platforms), but not all.
Support for --valid-utf8=invalid-utf8 for Fortanix SGX depends on the cfg_target_vendor feature, which was stabilized in Rust 1.33.0. Currently the gumdrop appears to work with Rust versions as low as 1.31.0, so this would either be raised to 1.33.0 or support for --valid-utf8=invalid-utf8 needs to be dropped for Fortanix SGX.

hanmertens added 7 commits February 8, 2021 23:20

Add preliminary OsStr support

b98be5f

This paves the way for deriving fields in Options structs directly from OsStr, e.g. allowing for non-UTF-8 file paths. Commands and short/long options still need to be valid UTF-8.

Allow for --valid-utf8=invalid-utf8 on Windows

4ac5c7d

This requires Opt::LongWithArg to use Cow (or (Os)String), &OsStr cannot be shortened on Windows. This in turn means the enum cannot be Copy, and should therefore be passed by reference whenever possible.

Coerce Cow<OsStr> to &OsStr explicitly

ab805e4

This does not happen automatically when a later function is generic (e.g. From::from).

Add support for parsing from &OsStr

28c7a1d

Three types of parsing support that have &str counterparts: from_os_str (named and unnamed) and try_from_os_str (named). The fourth type of &str-based parsing (unnamed try_from_str using std::str::FromStr) does not have an &OsStr counterpart.

Fix tests for Rust 1.31.0

aca1463

Add OsStr parsing instructions to documentation

2a6cfab

hanmertens marked this pull request as ready for review February 17, 2021 22:38

jirutka added a commit to jirutka/gumdrop that referenced this pull request Jan 12, 2023

Merge pull request murarth#46 from hanmertens/os-str

85bd7bb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for non-UTF-8 strings #46

Add support for non-UTF-8 strings #46

hanmertens commented Feb 8, 2021

hanmertens commented Feb 17, 2021

Add support for non-UTF-8 strings #46

Are you sure you want to change the base?

Add support for non-UTF-8 strings #46

Conversation

hanmertens commented Feb 8, 2021

What becomes &OsStr or AsRef<OsStr>?

Limitations

hanmertens commented Feb 17, 2021

What becomes `&OsStr` or `AsRef<OsStr>`?